An Unsupervised Approach for Product Record Normalization across Different Web Sites
نویسندگان
چکیده
An unsupervised probabilistic learning framework for normalizing product records across different retailer Web sites is presented. Our framework decomposes the problem into two tasks to achieve the goal. The first task aims at extracting attribute values of products from different sites and normalizing them into appropriate reference attributes. This task is challenging because the set of reference attributes is unknown in advance. Besides, the layout formats are different in different Web sites. The second task is to conduct product record normalization aiming at identifying product records referring to the same reference product based on the results of the first task. We develop a graphical model for the generation of text fragments in Web pages to accomplish the two tasks. One characteristic of our model is that the product attributes to be extracted are not required to be specified in advance and an unlimited number of previously unseen product attributes can be handled. We compare our framework with existing methods. Extensive experiments using over 300 Web pages from over 150 real-world Web sites from three different domains have been conducted demonstrating the effectiveness of our framework. Introduction The readily accessible Internet provides a convenient and cost-saving environment for both retailers and consumers. Many retailers have set up Web sites containing catalogs of products. Consumers can shop around over the Internet by browsing retailer Web sites. Recently, several specialized search engines have been developed for users to search and compare products from different retailer Web sites1. Such systems can help users match the same product from different retailer sites and find the best deal. One limitation of such systems is that retailers are required to manually input the value for each attribute of products to the database of the search engine via an interface. This may lead to out-ofdate information of products resulting in degradation of user ∗The work described in this paper is substantially supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Nos: CUHK4193/04E and CUHK4128/07) and the Direct Grant of the Faculty of Engineering, CUHK (Project Codes: 2050363 and 2050391). This work is also affiliated with the Microsoft-CUHK Joint Laboratory for Human-centric Computing and Interface Technologies. Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Examples are Google Product Search (http://www.google.com/products/) and Shopping.com (http://www.shopping.com). Figure 1: A sample of a portion of a Web page showing a digital camera collected from a retailer Web site. (Web site URL: http://www.superwarehouse.com) satisfaction. Moreover, the attributes of products in these systems are just simple and general fields pre-defined by the search engines, for example, brand name, model number, brief description, and price. However, there may exist some domain specific attributes such as the attribute “resolution” in the digital camera domain. These domain specific attributes are important because they can help users retrieve, analyze, and compare the products. Another problem is that it is not easy to determine whether product records from different Web sites refer to the same product. For instance, some Web sites may use different model numbers, or no model number for the same digital camera. Examples of such situations can also be observed in the data set used in our experiments. Figures 1 and 2 depict two Web pages collected from two different retailer Web sites. Both pages consist of the same product, but with different product names. To resolve the products, human effort and expert knowledge are required. Product record normalization is defined as the clustering of the same/similar product records into the same group. Bilenko et al. proposed a product normalization approach which aims at computing the similarity between products stored in a structured database (Bilenko, Basu, and Sahami 2005). Their method considers the linear combination of different basic similarity functions related to each field of records. One limitation of this approach is that it requires structured records with a fixed set of attributes. As a result, attribute values of each product need to be extracted manually in advance. Alternatively, attribute values may be extracted from Web sites automatically by making use of wrappers (Rurmo, Ageno, and Catala 2006; Chang et al. 2006; Zhao, Meng, and Yu 2007; Sarawagi and Cohen 2004; Viola and Narasimhan 2005; Wong and Lam 2007). HowFigure 2: A sample of a portion of a Web page showing the same digital camera to the one depicted in Figure 1, but collected from a different retailer Web site. (Web site URL: http://www.crayeon3.com) ever, a learned wrapper for a Web site cannot be applied to other site for information extraction because the layout formats are different. Consequently, each Web site needs to learn its own wrapper and training examples are required for every site. As a result, this approach will be infeasible for handling numerous retailer Web sites. Another limitation of the approach proposed by Bilenko et al. is that human effort is needed to prepare training examples of product normalization for learning the weights in the linear combination. Product normalization shares certain resemblance with the research area of duplicate detection or record linkage in database (Bilenko and Mooney 2003; Sarawagi and Bhamidipaty 2002; Ravikumar and Cohen 2004; Culotta et al. 2007). However, these approaches aim at matching records which have a fixed set of attributes in database. Therefore, they are not applicable to our problem in which attributes of products can be previously unseen and the number of attributes is unknown. In this paper, we aim at developing an unsupervised framework which can automatically conduct product record normalization across different retailer Web sites. To achieve this, our framework can also extract and normalize the domain specific attribute values of products. This can help users analyze the products. It is particularly useful when there is no identifier for products. We develop a probabilistic graphical model which can model the generation of text fragments in Web pages. Based on this model, our framework decomposes product record normalization into two tasks. The first task is the product attribute values extraction and normalization task. This task aims at automatically extracting text fragments related to some domain specific attributes from Web pages and clustering them into appropriate reference attributes. One characteristic of our approach is that it can handle Web pages with different layout formats. Another characteristic is that it can handle previously unseen attributes and an unlimited number of attributes. The second task is the product record normalization task. We tackle this task by considering the similarity between products based on the results from the first task. Product record normalization is then accomplished by another level of unsupervised learning. We have conducted extensive experiments using over 150 real-world retailer Web sites from three different domains. We have compared our framework with existing methods and the results demonstrate the effectiveness of our approach. Problem Definition Consider a collection of reference products P in a domain D. Each product pi ∈ P is characterized by the values of a set of reference attributes A. For example, in the digital camera domain, reference attribute may include “resolution”, “sensor type”, etc. The product shown in Figure 1 has a value of “10 Megapixel” for the reference attribute “resolution”. We let v i be the value of the attribute ai ∈ A for the reference product p. Notice that A is domain specific and the number of elements in A is unknown. Suppose we have a collection of product records R which refers to the set of realization of some products p ∈ P . For example, Figures 1 and 2 show two different product records. We let ri be the i-th product record in R and ri.U = p if ri is a realization of the reference product p. Notice that each reference product p may have several product records, while a product record r can only be a realization of a particular reference product. For example, the product records in Figures 1 and 2 refer to the same reference product. For each product record r, we let v i (r) be the realization of the value of the reference attribute ai of the product p for the product record r. For example, the attribute values of the reference attribute “light sensitivity” are “Auto, High ISO, ISO 80/100/200/400/8000/1600, equivalent” and “Auto ISO 80/100/200/400/800/1600” in Figures 1 and 2 respectively. We consider a collection of Web pages C which are collected from a collection of Web sites S. Each Web page c ∈ C contains a product record r. The Web page c can be considered as a set of text fragments X . For example, “Features” and “10 Megapixel” are samples of text fragments in the Web page shown in Figure 1. Each text fragment x ∈ X may refer to an attribute value v i (r). We let x.T = 1 if x refers to an attribute value of a product record, and 0 otherwise. Moreover, we have x.A = ai if x refers to the reference attribute ai ∈ A. We also define x.C and x.L as the content and the layout format of the text fragment x respectively. For example, the content of the text fragment “10 Megapixel” in Figure 1 can be the terms contained. The layout format of the same text fragment can be the color, font size, etc. Notice that x.T and x.A are unobservable, whereas x.C and x.L are observable. As a result, we can define product record normalization as follows: Product record normalization: Given two product records rci and rcj contained in Web page ci and cj , product normalization aims at predicting whether rci .U = rcj .U . To support this, the attribute values of reference attributes for a product record are required to be determined in advance. Therefore, we define attribute extraction and normalization as follows: Attribute extraction: Given a collection of Web pages C. Each page c ∈ C contains a record r ∈ R. The goal of attribute extraction is to discover all text fragments x ∈ X such that x.T = 1, given x.C and x.L. Attribute normalization: Given a collection of text fragments such that x.T = 1 for all text fragments in the collec-
منابع مشابه
An unsupervised method for joint information extraction and feature mining across different Web sites
We develop an unsupervised learning framework which can jointly extract information and conduct feature mining from a set of Web pages across different sites. One characteristic of our model is that it allows tight interactions between the tasks of information extraction and feature mining. Decisions for both tasks can be made in a coherent manner leading to solutions which satisfy both tasks a...
متن کاملExploiting Secondary Sources for Unsupervised Record Linkage
XML, Web services, and the Semantic Web have opened the door for new and exciting information integration applications. Information sources on the web are controlled by different organizations or people, utilize different text formats, and have varying inconsistencies. Therefore, any system that integrates information from different data sources must identify common entities from these sources....
متن کاملMinimally-Supervised Attribute Fusion for Data Lakes
Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains, even after disparate data is technically ingested into a common data lake. Sometimes this is a missing data issue, while in other c...
متن کاملUnsupervised Extraction of Popular Product Attributes from Web Sites
We develop an unsupervised learning framework for extracting popular product attributes from different Web product description pages. Unlike existing systems which do not differentiate the popularity of the attributes, we propose a framework which is able not only to detect concerned popular features of a product from a collection of customer reviews, but also to map these popular features to t...
متن کاملPositioning of Industries in Cyberspace Evaluation of Web Sites Using Correspondence Analysis
In today’s extremely competitive markets it is crucial for companies to strategically position their brands, products and services relative to their competitors. With the emerging trend in internationalization of companies especially SME’s and the growing use of the Internet with this regard, great amount of attention has been turned to effective involvement of the Internet channel in the mar...
متن کامل